Language-oriented user interfaces for voice activated services

ABSTRACT

A comprehensive system is provided for designing a voice activated user interface (VA UI) having a semantic and syntactic structure adapted to the culture and conventions of spoken language for the intended users. The system decouples the content dimension of speech (semantics) and the manner-of-speaking dimension (syntax) in a systematic way. By decoupling these dimensions, the VA UI can be optimized with respect to each dimension independently and jointly. The approach is general across languages and encompasses universal variables of language and culture. Also provided are voice activated user interfaces with semantic and syntactic structures so adapted, as well as a prompting grammar and error handling methods adapted to such user interfaces.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a divisional of U.S. application Ser. No.09/456,922, filed Dec. 7, 1999, now allowed.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates to user interfaces for voiceactuated services. In particular, the present invention relates to userinterfaces specifically adapted to the spoken language of the targetusers. The present invention specifically provides bothlanguage-oriented user interfaces and generally applicable systems andmethods for building such language-oriented user interfaces.

[0004] 2. Description of the Related Art

[0005] A user interface is a component or tool of a computer system thatenables a user to interact with the computer system, either to issueinstructions controlling the operation of the system, enter data,examine results, or perform other operations in connection with thefunctions of the system. In effect, the user interface is the computer's“cockpit.” That is, the user interface presents information about thecomputer's operation to the user in an understandable form, and itenables the user to control the computer by converting the user'sinstructions into forms usable by the computer. Various types of userinterfaces exist, such as text (or “command line”) interfaces, graphicaluser interfaces (“GUIs”), Dual Tone Multi-Frequency (DTMF) interfaces,and others.

[0006] “Voice activated” (VA) or “voice controlled” (VC) user interfacesare a promising alternative type of user interface that enable users tointeract with the computer by spoken words. That is, rather than typingin text commands, pressing numbers on a telephone keypad, or “clicking”on a graphical icons and menu items, the user provides instructions anddata to the computer merely by speaking appropriate words. The abilityof a user interface to receive inputs by voice signals has clearadvantages in many application areas where other means of input(keyboard, telephone keypad, mouse or other pointing device, etc.) areunavailable or unfamiliar to the user.

[0007] Unfortunately, voice activated user interfaces (“VA UIs”) havegenerally failed to provide the level of usability necessary to makesuch devices practical in most application areas. This failure has beendue in part to inherent technical challenges, such as the difficulty ofreliably converting spoken words into corresponding computerinstructions. However, continuing advances in acoustic signalrecognition (ASR) technologies have largely removed such obstacles. Thepersistent inadequacies of existing VA UIs therefore arise from designflaws in the UIs themselves, rather than lack of adequate implementingtechnology.

[0008] Currently, voice activated user interfaces (VA UIs) are designedand implemented in an ad hoc manner. Most developers overlay avoice-activated UI onto a dual-tone multiple frequency (DTMF) UI andperform after-the-after fact testing on the integrated unit. Tests ofthese system are therefore performed without consideration of the changein input modality (spoken versus DTMF keypresses) and for the newusability effects generated by the coupling between the varioussubmodules of the system.

[0009] Trial and error is the most common approach for VA UI design anddevelopment. The vocabulary wordset for the service is often the literaltranslation of the English command words used for the task into thetarget language. Two typical prompting structures are (1) to list outall the options at once and wait for the subscriber to speak the choice(either at the end or by barging-in), or (2) to say the options one at atime, and provide a pause or yes/no question to signal the subscriber tomake a choice. Textual (visual) UIs essentially follow the firstapproach, while DTMF UIs use the second approach. Explicit turn-takingis generally signalled by introducing a tone to indicate that thesubscriber should speak.

[0010] However, to serve the needs of users effectively, a VA UI musthave characteristics and must satisfy ease-of-use requirements differentfrom those of a DTMF or visual/textual UI. The need for thesedifferences arises because verbal dialogues are dynamic socialinteractions and differ across languages and cultures in ways that arenot paralleled in visual or written interactions. To have any practicalsignificance, therefore, a VA UI must flexibly accommodate differentcommand words, tempos in which they are spoken, and ways in whichturn-taking is signaled in the language in which the human-machineconversation is taking place. Put another way, designing a VA UI to bemore than a technical curiosity requires more than simply adding(overlaying, substituting) command words to a DTMF service. All users,whether first-time, average, or experienced, must find the UI highlyacceptable and easy to use.

[0011] On the other hand, it has been the accepted wisdom thatpresent-day software technology is too rudimentary to make possible userinterfaces that are actually easy to use. U.S. Pat. No. 5,748,841,issued May 5, 1998, to Morin et al., expresses this view as follows: “Inone respect, the problem may be that even complex computer applicationsand computer programs do not provide the flexible input/output bandwidththat humans enjoy when interacting with other humans. Until that dayarrives, the human user is relegated to the position of having to learnor acquire a precise knowledge of the language that the computerapplication can understand and a similar knowledge of what the computerapplication will and will not do in response. More precisely, the humanuser must acquire a knowledge of enough nuances of the applicationlanguage to allow the user to communicate with the application insyntactically and semantically correct words or phrases.”

[0012] Thus, the state of the art in user interface technology hasexplicitly assumed that effective use of a practical user interfacerequires the user to learn the syntax and semantics that are employed bythe user interface. There has existed an unmet need for a user interfaceadapted to the conventions of the user's spoken language. Heretoforethis need has actually been considered to be unmeetable with existingsoftware technology. This need has been particularly acute for voiceactivated user interfaces, because the conventions of spoken languagevary much more widely between different communities than the conventionsof written language. Furthermore, voice activated services may havegreatest potential for growth among users with little computerexperience, provided usable VAUIs that follow univeral spoken languageprinciples become available.

SUMMARY OF THE INVENTION

[0013] It is an object of the present invention to provide a method ofdesigning language-oriented user interfaces for voice activatedservices.

[0014] The present invention provides, in a first aspect, a method fordesigning a voice activated user interface, the method comprisingseparately selecting a vocabulary set and a prompting syntax for theuser interface based on results of first testing with subjects from atarget community. The method further comprises jointly optimizing thevocabulary set and the prompting syntax based on results of secondtesting with subjects from the target community.

[0015] In a second aspect, the invention provides a method for selectinga vocabulary set for a voice activated user interface. The method ofthis aspect comprises collecting responses to task-oriented questionseliciting commonly used names for tasks and task-related items, andselecting a plurality of responses from the collected responses based onfrequency of occurrence in the collected responses.

[0016] In a third aspect, the invention provides a computer system andcomputer software providing a service through a voice activated userinterface. The computer system comprises a storage and a processor. Thestorage has a vocabulary of command words stored therein, each commandword being selected from responses to questions posed to members of atest group. The processor interprets a spoken response based on thestored command words. The computer software comprises instructions toperform the corresponding operations.

[0017] In a fourth aspect, the invention provides a method for defininga prompting syntax for a voice actuated user interface. The method ofthis fourth aspect comprises identifying an initial value for each ofone or more syntax parameters from samples of dialogue in aconversational language of a target community. The method furthercomprises specifying an initial temporal syntax for the user interfacebased on the one or more identified initial values.

[0018] In a sixth aspect, the invention provides a method for optimizinga prompting syntax of a voice actuated user interface, the methodcomprising testing performance of tasks by subjects from a targetcommunity using a the interface implemented with a command vocabularyand a temporal syntax each selected for the target community. The methodof this aspect further comprises modifying the temporal syntax based onresults of the testing.

[0019] In a seventh aspect, the invention provides a method for defininga prompting syntax for a voice activated user interface, the methodcomprising specifying an initial temporal syntax for the user interfacebased on initial syntax parameter values identified through dialogueanalysis. The method of this aspect also comprises modifying the initialtemporal syntax based on results of testing user performance with theuser interface using a selected command vocabulary with the initialtemporal syntax.

[0020] In an eighth aspect, the invention provides a method foroptimizing a voice activated user interface, the method comprisingconfiguring the user interface with a vocabulary of command wordsincluding at least one word indicating a corresponding task and selectedfrom plural words for the task based on frequency of use. The method ofthis aspect also comprises changing at least one of a command and asyntax parameter of the user interface based on results of testing theuser interface with speakers of a target language.

[0021] In a ninth aspect, the invention provides a method for adaptiveerror handling in a voice activated user interface. The method comprisesdetecting that an error has occurred in a dialogue between the user andthe user interface based on a change in behavior of the user. The methodfurther comprises reprompting the user when the error is an omissionerror, and returning to a previous menu state responsive to a correctioncommand by the user when the error is a commission error.

[0022] In a tenth aspect, the invention provides a method for adaptiveerror handling in a voice activated user interface. The method of thisaspect comprises detecting that an error has occurred in a dialogue withthe user interface following a prompt delivered according to a firstprompting structure, and reprompting the user according to a secondprompting structure when a count of errors exceeds a predeterminedvalue.

[0023] In an eleventh aspect, the invention provides a method foradaptive error handling in a voice activated user interface, the methodcomprising selecting an error prompt level based on an accumulatednumber of user errors when a user error occurs in a dialogue between theuser interface and a user. The method of this aspect further comprisesreprompting the user according to the selected error prompt level.

[0024] In a twelfth aspect, the invention provides a computer system andcomputer software providing a service to a user through a voiceactivated user interface. The computer system comprises a storage and aprocessor. The storage stores a menu of commands usable by the user in adialogue between the user and the user interface. The processor detectsan error in the dialogue based on a change in behavior of the user,reprompts the user when the error is an omission error, and returns to aprevious menu state responsive to a correction command when the error isa commission error.

[0025] In a thirteenth aspect, the invention provides a computer systemand software providing a service to a user through a voice activateduser interface, the computer system comprising a storage and aprocessor. The storage stores a menu of commands usable by the user in adialogue between the user and the user interface. The processor promptsa command selection by the user according to a first prompting style,detects an error in the dialogue when the error occurs, and prompts acommand selection by the user according to a second prompting style whena count of errors by the user during the dialogue exceeds apredetermined value.

[0026] In a fourteenth aspect, the invention provides a method forprompting a user of a voice activated user interface. The method of thisaspect comprises pausing for a first predetermined interval afterpresentation of a label identifying a current menu state of the userinterface. The method further comprises presenting to the user a commandoption for the current menu state only when a command is not receivedfrom the user during the predetermined interval.

[0027] In a fifteenth aspect, the invention provides a method fordeveloping an automatic speech recognition (ASR) vocabulary for a voiceactivated service. The method comprises posing, to at least onerespondent, a hypothetical task to be performed and asking each of theat least one respondent for a word that the respondent would use tocommand the hypothetical task to be performed. The method of this aspectfurther comprises receiving, from each of the at least one respondent, acommand word developing a list of command words from the receivedcommand word, and rejecting the received command word, if the receivedcommand word is acoustically similar to another word in the list ofcommand words.

[0028] Additional objects and advantages of the invention will be setforth in part in the following description and, in part, will be obvioustherefrom or may be learned by practice of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0029] Further features and advantages of the present invention, as wellas the structure and operation of various embodiments of the presentinvention, will become apparent and more readily appreciated from thefollowing description of the preferred embodiments, taken in conjunctionwith the accompanying drawings of which:

[0030]FIG. 1 is a block diagram illustrating a general context for andseveral embodiments of the present invention;

[0031]FIG. 2 shows an overview flow diagram of a method provided by thepresent invention;

[0032]FIG. 3 shows a more detailed flow of a method for vocabularyselection provided by the present invention;

[0033]FIG. 4 shows an chart of command sub-menus and command functionsfor an exemplary voice controlled voice mail service;

[0034]FIG. 5 shows a table of exemplary vocabulary testing questionsadapted for use with various aspects of the present invention;

[0035]FIG. 6 shows a flow diagram illustrating a method of selecting aninitial temporal syntax as provided by the present invention;

[0036]FIGS. 7A and 7B respectively show a template of a prompt grammarprovided by an aspect of the present invention and an example promptgrammar for the illustrated template;

[0037]FIG. 8 shows a flow diagram illustrating a prompting methodprovided by the present invention;

[0038]FIG. 9 shows a flow diagram illustrating a secondary promptingstructure provided by the present invention;

[0039]FIG. 10 shows a flow diagram illustrating an error handling methodprovided by the present invention;

[0040]FIG. 11 shows a flow diagram illustrating another error handlingmethod provided by the present invention;

[0041]FIG. 12 shows a flow diagram illustrating a method for adaptiveprompting levels as provided by the present invention; and

[0042]FIG. 13 shows a block diagram illustrating a general errorhandling procedure of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0043] Reference will now be made in detail to the presently preferredembodiments of the invention, examples of which are illustrated in theaccompanying drawings, wherein like reference numerals refer to likeelements throughout.

[0044] Overview

[0045]FIG. 1 illustrates a computer system 1 that provides both ageneral context for and several selected embodiments of the presentinvention. System 1 may itself provide a useful service to users, orsystem 1 may constitute a “front end” through which users communicatewith another system coupled to system 1, such as computer system 3.

[0046] Computer system 1 includes a storage 8, which may be a massstorage device (such as a magnetic or optical disk drive), a memorydevice, or other suitable data storage device. A processor 6 usesprograms and data retrieved from storage 8 to provide a VA UI 10 throughwhich a user (not shown) can interact with computer system 1. The usermay provide inputs to system 1 through a sound conversion device such asmicrophone 12. Typically responses or other information may be output tothe user through a sound generating device such as loudspeaker 16, whichpreferably generates synthesized or recorded voice sounds.

[0047] The VA UI 10 is preferably implemented by a software programrunning on processor 8 and conceptually illustrated in FIG. 1 as adashed box including on the one hand a command vocabulary stored in thestorage 8, and on the other hand a process running on the processor 6.The process, labeled “MENU STATES⊕TIMING” in FIG. 1, defines menu statesfor the VA UI 10 and timing for the flow of a dialogue between a userand the VA UI 10. Alternatively, VA UI 10 may be implemented in specialpurpose circuits that may the composed of integrated circuits ordiscrete components. Computer system 1 may be used by the user, throughinteractions with the VA UI 10, to obtain services or to perform tasks.These services may be performed by other software programs running onprocessor 8 or by one or more other processors (not shown) included incomputer system 1. Alternatively, the services or task performance maybe provided by any of peripheral devices 16, 18, etc., which may beincluded in computer system 1, or by computer system 3 in communicationwith computer system 1.

[0048] The present invention embodies novel and unusual concepts fordesigning a voice activated interface such as VA UI 10. Heretofore therehave existed few de facto guidelines for design and development of a VAUI. Consistent with the fact that few services and deployments exist,all of the existing principles have been ad hoc in nature and narrow inscope. The user has been expected to adopt the vocabulary of the UI,without any recognition that the user might naturally choose differentwords to designate given tasks. Further, there has been a failure toconsider explicitly the possibility of dialog management through verbal(or implicit) “turn taking,” in which an opportunity for response issignaled by the manner of speaking, and a response is anticipated. Evenmore so, the existing approaches have failed to recognize the effects onVA UI performance of variations in social interactions from country tocountry, or even from region to region within a country.

[0049] The present invention proceeds from the realization that aneffective VA UI should be designed to account for two complementaryaspects of spoken dialogue that roughly correspond to the linguisticconcepts of semantics and syntax. These paired concepts appear in adialogue as content and manner of speaking, and they correspond to thefunctional characteristics of parallel association and temporallinearity. Hence “verbal semantics,” or simply “semantics,” will hereencompass what the words mean and when the meaning of a concept isunderstood.

[0050] “Verbal syntax,” or simply “syntax,” includes the temporalstructure underlying the sequence of spoken words and the grammaticalrelationships between the words.

[0051] The invention provides a universal framework that expresslyaccounts for the distinct aspects of semantics and syntax in a VA UI.The invention also provides a mechanism for explicitly accommodatingcross-cultural spoken language variations in verbal communication. Thesemantics of the VA UI can be designed to incorporate commonly usedwords in the spoken language of the intended users. The specificlanguage variant as spoken by the expected user population for theservice will be called the “target language.”

[0052] Further, the invention allows the VA UI to incorporate thesyntactic conventions particular to the language and culture of theexpected users. The community of expected users will be called the“target community.” A “conversational language” of the target communityis a language habitually used by members of the target community forroutine conversations such as casual talk, routine purchases or businesstransactions, and so forth. Typically the target language of the VA UIwill be such a conversational language of the target community.

[0053] A key discovery embodied in the present invention is that thedesign of different components of a VA UI can proceed separately. Thatis, it has been found that the design process for a VA UI can be“decoupled” based on linguistic universals as applied to spokenlanguage. The decoupled components are defined and refined separately,and then combined in the task domain for integrated optimization. The UIdesign, testing and modification processes of the present inventionfocus on the means to decouple content (semantics) and manner (syntax)in a systematic way. The recognition that such a decoupling is possible,and implementation of this decoupling in a structured methodology,permits significant improvement in performance of the resulting VA UI.

[0054]FIG. 2 shows a conceptual diagram of a VA UI design process of thepresent invention. The first step is to decouple UI semantics andsyntax, to the degree possible. Definition of the call flows for thetarget application is conceptually represented by block 20. The analysisof semantics and syntax are then “decoupled” by following separatedesign tracks for vocabulary and temporal structure, respectively. Theseseparate design tracks can be implemented either serially or inparallel.

[0055] Block 30 of FIG. 2 represents the semantics design track, whichencompasses vocabulary testing and selection of a language-specificpreferred vocabulary set. These procedures will be discussed in detailbelow with reference to FIG. 3. Block 60 represents the syntax designtrack, which corresponds to proposing an initial structure for temporaltesting on the sequences of temporal operations leading to selection ofinitial syntax parameters for specification of an initiallanguage-specific syntax structure. Whereas the vocabulary testing trackcenters around a question-and-answer paradigm to elicit informationrelating to word content, the syntax testing track of block 60 centersaround a paradigm of eliciting spoken “sentences” from the testsubjects. In this context, a “sentence” may be a grammatically correctsentence, a phrase, a series of phrases, or any other fragment of spokenlanguage for which the temporal structure may be characteristic ofspoken conversation in the target community. Procedures for syntaxspecification will be discussed in detail with reference to FIG. 6.

[0056] Block 70 represents the integration stage of the design process,where the separate vocabulary set and syntax structure are combined intoan integrated language-specific dialogue structure for the UI and testedagainst performance criteria. Block 80 represents the optimization stageof the design process, where the integrated dialogue structure ismodified based on the results of the performance testing.

[0057] The customization of the syntax for a target language begins withan analysis of conversational manner, which then permits thespecification of the initial temporal syntax for the dialogue. The goalis to identify a syntactical structure incorporating language-specifictemporal features, such as pausing and pacing that provide turn-takingcues, and placing them into a temporal template, defined by temporalrules for that grammar.

[0058] The invention also embodies the discovery of a general promptgrammar (or syntactical template) that is particularly effective for VAUIs, and a method for prompting users of a voice-activated UI. Themethod includes a first embodiment in which a menu name is stated to seta context, a first pause for rapid response is provided, and thenseveral sets of menu selections are offered to the user in succession.Each set of menu selections is a conceptual “chunk” of 2-4 choices. Thechunk size, although conventionally thought to be a memory-dependentconstant, is here considered to be a culturally-dependent variable.

[0059] With initial semantic and syntactic structures defined, the nextstep is to combine these structures into a “prompting structure.” Herethe term “prompting structure” will refer to an integrated dialoguestructure composed of a semantically appropriate vocabulary word setimplemented with a language-specific temporal syntax. The promptingstructure is then optimized as a unit. The present invention provides amethod for optimizing the customized semantics and the initial syntax incombination, thereby to fine-tune the syntax and optimize the usabilityof the VA UI. This approach allows the integrated prompting structure tobe fully adapted to the speech conventions of a particular language andculture.

[0060] The method involves having each test participant engage in aninteraction with the aforementioned words in a baseline syntax toachieve service-specific tasks. The user works to complete the tasks,and data are collected on key variables such as task duration, barge-infrequency and location. and throughput rate. Data also may be collectedthrough interviews and questionnaires. It is preferred that alternativeprompting structures are tested and compared to isolate the effects ofsyntactic changes.

[0061] The basic realization of the approach enables selection of thebest words the subscriber should say to the service, and construction ofthe best prompts that the service should say to the subscriber. Theapproach is general across all spoken languages, encompasses languageand cultural universals, and applies to any voice activated service.Voice Control of Voice Mail (VCVM) is used herein to illustrate the VAUI design techniques of the present invention since it providessignificant complexity in which to reference VA UI instantiations.However, persons of ordinary skill in the art will readily appreciatethat the examples described herein can be easily applied to other VAapplications by following a similar methodology.

[0062] The principle of decoupling the semantic and syntactic parts ofthe UI also provides advantages when applied to error handling. In afurther aspect, the invention provides an adaptive error handling anderror correction method that employs a general error-handling paradigmof notification, status, and solution, with its own syntax andsemantics. As a further embodiment of semantic and syntactic decoupling,the method treats errors of omission and errors of commissionseparately.

[0063] Semantic Structure

[0064] A significant and unusual aspect of the present invention is amethod for designing a voice command vocabulary, or “wordset,” (forvoice recognition) with command words chosen to make the VA UI bothreliable and easy to use. This method addresses the wordset semantics ofthe UI and balances common (natural) usage and acoustic (recognition)differentiation. Specifying the vocabulary word set semantics for a VAservice begins by addressing the often-conflicting criteria of useracceptance and acoustic discrimination. The process utilized here is toidentify command words by asking speakers of the target languageindirect questions and to receive responses that contain words mostlikely to be spoken by service subscribers to invoke a service featureor function.

[0065] The design of the semantic component therefore begins with asecond level of decoupling in which pure semantics are separated fromacoustic analysis. This enables a set of optimal choices for the overallvocabulary set to be specified. The resulting, theoreticallysemantically optimal vocabulary set is then re-combined with acousticsand optimized in the context of sub-vocabulary recognition accuracy.

[0066] An embodiment of the method may proceed according to thefollowing outline. First, a basic set of questions in the targetlanguage is prepared. The questions are designed to elicit responsesthat are words (or short phrases) commonly used to identify outcomes orcommands for the target VA application. Frequent responses are selectedas likely command words and grouped into subvocabularies correspondingto the various command menus of the service application. Acousticanalysis of each subvocabulary identifies pairings that may presentproblems for acoustic differentiation, and appropriate substitutes areselected from the list of semantically equivalent responses. Somevocabulary words occur in multiple subvocabularies, so analysis isperformed for each word across all applicable subvocabularies.

[0067]FIG. 3 illustrates a flow diagram, corresponding to block 30 inFIG. 2, that details implementation of the invention to select apreferred vocabulary set for the target application and the targetcommunity. The illustrated method encompasses operations for vocabularytesting, followed by acoustic differentiation. The goal of thesequential method is to identify a final set of most likely words thatcan be expected to be spoken in the target language as commands to thevoice-activated service.

[0068] The goal of the vocabulary test is to identify a final set ofmost likely words that can be expected to be spoken in the targetlanguage as commands to the voice activated service. Here, “word” meansa word or phrase that is spoken to indicate an integral task concept.For example, “erase” may be used to instruct the system to erase amessage just recorded, while skip password” may be used to instructpassword verification to be omitted. Thus, in this description thetechnical term “word” is not limited literally to single words in thetarget language.

[0069] In the following description, occasional reference will be madeto a voice-controlled voice mail (VCVM) service as an example VAapplication. These references to the VCVM service are purely forpurposes of concrete examples and are not intended to imply that thepresent invention is limited to voice mail services. Rather, as notedabove, the invention provides a universal framework applicable to allvoice activated services. Examples of such services intelecommunications fields include personal assistant, voice activateddialing, directory assistance, reverse directory assistance, callrouting, switch-based feature activation, and so forth. The inventionalso has application to voice activated services in other areas ofcommerce and industry. as will be apparent to those of skill in the art.

[0070] The first stage of the illustrated method, at block 310, is toselect those command functions of the target application for whichcommand words will be specified using vocabulary testing. It ispreferred, for cost effectiveness of the design process, that onlycommand functions meeting certain criteria be specified by testing inthe target language. The selection process of block 310 will now beexplained.

[0071]FIG. 4 illustrates a set of sub-menus and command words (inAmerican English) for the exemplary VCVM service. The target service forthe VA UI imposes constraints on the set of words which may be used toexecute the service. The set of words used in the non-VA service, suchas the command words illustrated in FIG. 4, provides an initial guess atthe target words to be investigated.

[0072] This base set may be composed of the existing key words used inthe call flows. By identifying the key words and looking at each callflow of the service, a table can be made which lists the word and thecall flow in which the word is used. The base words are thenrank-ordered according to frequency of use in the service. This providesa quantitative measure (also called a “local performance indicator,” or“local PI”) by which a cost-benefit analysis can be performed. Thecost-benefit analysis identifies the base words for whichtarget-language specification is expected to have the greatest impact onthe service. In other words, specification of these high-ranking (i.e.,most frequent) words will provide the greatest benefit in usability ofthe VA UI for the fixed cost to obtain each specification result.

[0073] It has been found that words in the base set tend to aggregateinto three major categories, which are termed “universal,” “uncertain,”and “distributed.” Universal base words are those for which testresponses are found to be limited substantially to a single word.Uncertain words are those for which the test responses are more-or-lessequally divided across many choices. Distributed words correspond tohaving one clear preference in the test responses, yet there are otherviable alternatives which can also be used.

[0074] Base words that are universal or uncertain need not be includedin the semantic testing, and therefore the cost of the semantic analysisfor those words can be avoided. Instead, for a universal word thepreferred procedure is to use the (single) response word as defined bythe subscribers. For an uncertain word, the vocabulary word ispreferably selected by the service developer from the availableresponses. This leaves the distributed words of the base set as thecommand functions selected for specification at block 310.

[0075] The next stage, at block 315 of FIG. 3, is to prepare questionsfor the vocabulary testing. These are very simple, general, spokenquestions that are posed to volunteer members of the target community.The questions are translated and presented in the target language andare designed to elicit responses from the test subjects (the volunteers)that will be candidates for the final vocabulary set. The objective isto ask questions that will be answered with words commonly used bymembers of the target community to indicate the application-specificcommands or items of the target application. Here a “question” is arequest for a response, irrespective of whether the request is formed asa literal interrogatory statement. For example, an imperative statementbeginning “Please state the word you would use for . . . ” wouldconstitute a “question” for the present purposes.

[0076] An example set of such questions, adapted for use with theexemplary VCVM application discussed herein, is illustrated in FIG. 5.The preferred question format follows a scenario/goal paradigm. Forexample, the question may describe a scenario relating to a specifictask, and then specify a goal related to that task. In a preferred formof the questions, a short introductory statement orients the listener tothe nature of the task.

[0077] The questions are designed to elicit responses relevant to thetarget application. Thus, the example questions in FIG. 5 relate tofunctions and tasks ordinarily performed with a voice mail application.The questions are preferably ordered according to difficulty, with easyquestions in the beginning, so that the test subjects build confidenceas they perform the test. Also, it is desirable that similar questionsnot be located close together in the question sequence.

[0078] It is preferred that the questions be purposely formulated to bevague, in order not to pre-dispose the subject to selection of wordsthat are used in the prompting questions. This helps to ensure that thesubject does not merely “parrot” words that are heard in the particularprompt or in a previous question. It is also preferred that thequestions be open-ended, rather than multiple-choice. The open-endedformat has the advantage of forcing the subject to formulate an originalresponse, rather than merely choosing from a list.

[0079] A second stage of the question preparation, after the questionshave been formulated and translated, is a pilot test to refine thequestions prior to the primary vocabulary testing. The purpose of thepilot test is to finalize the word-set questions by identifying andeliminating any confusing aspects. This ensures that the final word-setquestions have no ambiguity and are readily understood. Preferably thepilot test comprises presenting the questions to a few subjects (forexample, 4-5 members of the target community) in the target language. Atape recorder may be used to record the questions and responses forlater, more detailed analysis. Also, the test questions may be followedby post-test interviews.

[0080] A native speaker then records the questions onto a computerrunning in data collection mode using a “voice form” IVR application. Ina particularly preferred implementation, the test system includes a setof telephones accessing a TRILOGUE™ computer, by Comverse NetworkSystems, Inc. The TRILOGUE™ computer has multiple active incomingchannels and typically runs a set of, for example, 30 data boxes(“D-boxes”) in linked mode to support 30 vocabulary test questions. Itis preferred that the recordings be prepared after the pilot test andany appropriate clarification of the question set.

[0081] Returning to FIG. 3, the next phase of the method is the mainvocabulary testing at block 320. This trial includes presentation of thetest questions to a group of subjects from the target community andcollection of responses. In the preferred implementation the trialparticipants (preferably at least about 30, and more preferably up to 50or more to tighten the confidence intervals of the results) call in tothe platform and listen to the questions. Each question is a prompt thatinvites the participant to speak a response. In the preferredimplementation the responses are recorded by the trial platform. In anycase, the presentation of the questions and collection of responsesgenerates the raw data to be used for semantic analysis.

[0082] Various refinements may be included in the off-line semanticanalysis (blocks 325-360).

[0083] After the data collection is completed at block 320, thetranslator may assist in isolating the words spoken and placing them ina spreadsheet so that a word frequency analysis can be performed. Thetranslator preferably identifies words that are slang or uncommon.Normal conversations also include thought-transitioning sounds, such asinterjections, conjunctions or vocalized pauses. These “non-content”words and utterances are preferably identified in the responses for eachquestion and removed from the word frequency analysis.

[0084] The final tallies in the frequency distribution (block 325)represent the likelihood of occurrence of a word to the question or tosimilar questions. The candidate word selection (block 330) may employ afew basic rules for identifying preferred words based on the tallies.

[0085] A word spoken by all the subjects is most certain to be thepreferred word for that particular Task (“universal”, as definedearlier). On the other hand, a small number of semantically equivalentresponses to a question indicate no clear preference (“distributed, asdefined earlier) In the latter situation the response word with bestrecognition accuracy is selected. When there are many differentresponses to a question, no preference is shown and the word is chosento advantage the ASR engine (“uncertain”, as defined earlier).

[0086] With this approach, it is possible to identify for each sub-menucommand words that are both easy to use and have high recognitionaccuracy. When the selected words are mapped onto the anticipated menustructure of the target application (block 335 of FIG. 3), some menusmay have words with acoustically similar pronunciations, or words ofshort length, or both. Each of these conditions will adversely impactrecognizer performance, and so their effects must be reduced.

[0087] For example, a token adjective may be added to a short word,thereby reducing the potential for confusion with a similar short wordin the same sub-menu. Preference may be given to a word that is morecommon, even though another word is semantically equivalent, which wouldmaintain consistency with a selection for an earlier sub-menu.Occasionally a word is proposed because it is the best semantic match(closest equivalent meaning) in the target language for the commandfunctionality.

[0088] Semantic optimization is performed off-line. First, at block 325of FIG. 3, a frequency distribution is generated for the collectedresponses. An analysis is then performed on the word frequencies atblock 330, which enables the selection of frequently-occurring responsesas likely command words for the VA UI. The selected responses provide apreliminary, target-language vocabulary for the interface. The selectedcandidate words are then divided at block 335 into appropriatesub-vocabularies (compare the various menus shown in FIG. 4).

[0089] It is noted that the candidate words selected at block 330 areonly likely command words for the VA UI. In fact, it is preferred thatthe selection procedure of block 330 include selection of alternativesto the preferred candidate words. This is because some of the selectedcandidate words may have acoustic similarities to other candidate wordsin the same sub-vocabulary.

[0090] For each sub-vocabulary, a basic acoustic analysis is performedat block 340 to quantify any acoustic similarities and to identify wordsthat must be reviewed. An example format for the basic acoustic analysiswill be presented below. If a pair of words is found to be acousticallysimilar (YES at block 345), then the method proceeds to block 350 wherean alternative for at least one of the similar words is selected. Theprocedure then returns to block 340 for basic acoustic analysis of thesub-vocabulary including the substituted alternative word(s).

[0091] The subvocabularies are tested in the order of most likelyfrequency of usage. For the example VCVM, the Main Menu is tested first,then the End of Message Menu, then Mailbox Options Menu, etc. Each timea sub-vocabulary passes the acoustic similarity test at 345, the methodadvances to the next sub-vocabulary (block 355) until no moresub-vocabularies remain to be tested (block 360). This leads todefinition of a final vocabulary, which is then proposed for morecomprehensive acoustic analysis in view of the specific ASR engine(i.e., the speech recognizer) to be used in the VA UI. The comprehensiveanalysis, in turn, validates the final vocabulary word set as satisfyingthe system performance criteria for recognition accuracy.

[0092] In a preferred implementation of the basic acoustic analysis(block 340), a phonetic transcription is first performed using theARPABET representation. Common alternative pronunciations can also beincluded. Rules that characterize the potential types of deletion andsubstitution errors are applied.

[0093] In a particularly preferred embodiment, the deletion rulesapplied between two words may be as follows: 1=same number of syllablesin a pair of words; 2=exact vowels in a pair of words; 3=exact vowel inidentical syllables in a pair of words. Also, the substitution rulesapplied between two words in this preferred implementation may be asfollows: 1=identical phonemes anywhere in the words; 2=identical phonemein the same syllable position; 3=identical vowel in the same syllable;4=identical vowel with the identical phoneme context; and, 5=identicalphonemes and the same vowel in the same syllable. If any subvocabularyword-pair contains more than one full set of rule matches, the pair isconsidered a candidate for modification using alternative wordsdetermined from the initial semantic testing and analysis.

[0094] A preferred enhancement of the basic approach is to tune thefinal vocabulary to the target language and the target serviceapplication. In this procedure, consistency of usage throughout the UImay be considered for both grammatical forms and phrase structures. Forexample, candidate words may be considered to describe actions which canbe taken to manage a group list in the example VCVM application. In thissituation it may happen that the frequency analysis (block 325) revealsno strong preference among the test subjects for words to be used.

[0095] In this example situation it is possible that a word choicetranslating literally to “group list” or “options list” may be passedover in favor of a word meaning “distribution list.” One reason for thelatter choice would be to maintain semantic equivalence with the Englishlanguage counterpart. Similarly, a command may be modified to adifferent verb form (viz., progressive tense vs. infinitive) to maintainconsistent usage of verb forms (action words) where possible.

[0096] Some words may be specific to the application and have no clearcounterpart in the target language. In such cases, command words may beselected as those commonly understood, even though semantic equivalenceto the functionality being named is less than perfect. Articleadjectives may be added when the impact on the length of the transactionis slight compared to the amount of clarity or user friendliness itadded. In some languages, adding a particular article may make theactual utterance longer and hence more attractive for an interrupt word.

[0097] Using the same word across multiple menus reduces the cognitiveload on the user, because the word then refers the same concept andconsequently leverages the user's comprehension and learning from anearlier menu state into later states. For example, using a word as acommand to enter a menu and then having the same word announced (inecho) as the name of that menu is considerable positive reinforcement tothe user. Similarly, parallelism may be used advantageously to reinforcesimilarities between objects of commands. For example, a word used toname the primary fax telephone number may indicate in one menu that thedigits are to be entered, while in another menu the same word mayindicate that the number is being used for message transfer.

[0098] Syntactic Structure

[0099] Returning to FIG. 2, the second design track (which can proceedin parallel with the semantic analysis outlined above) is to identify aneffective syntax. The present invention provides a method for defining aformal structure (called a “syntax”) that includes the temporal rulesand prompting manner to be used in the VA UI. Here a “syntax” for a VAUI is defined to be a structure for sequential presentation of spokeninformation defined by a set of rules. A conversational syntax may beimplemented in a VA UI for a target service and a target language byspecifying a prompt structure according to a set of grammatical rules.The components of the structure include context cues (e.g., menulabels), carrier phrases (explicit or implicit), slots (places) forwords, intervals for pauses between words and phrases, intonationcontours for words and phrases, and other prosodic features.

[0100] The existing approaches to VA UI design have failed to recognizethat improved VA UI performance can be realized by identifying andtaking advantage of those areas where the verbal modality of dialoguediffers from the written modality in the element of time. An effectiveVA UI must prompt the user in a manner that both provideseasy-to-understand information for response and also must signal when itis time for the user to respond with commonly used command words.Recognition of these requirements leads to a framework within which toconsider and implement cues for “turn-taking,” that is, grammatical andtemporal indications that a unit of information has been conveyed and anopportunity for response will follow.

[0101] Any baseline syntax may be constructed with tokens (words) havingsome semantic applicability to the service. However, syntacticparameters are more accurately specified for a target language if thesemantic content is chosen as described above, so as not to addadditional cognitive load to the user who is intended to react to mannerin which the message provided by the baseline syntax. Optimization ofthe UI in view of this coupling is described below. For a specificlanguage, the temporal structure (the syntax) itself requiresspecification of pace, pauses, intonational cues, and means to presentinformation “chunks” (e.g., groups of options). Every language andculture follows some conversational universals, but speed ofpresentation, length of turn-taking pauses, and clause (e.g., chunk)intonation all vary in different degrees between specific languages.Pauses are significant for at least two reasons: they mark out theboundaries of informational chunks, by highlighting beginning andending, and they signal turn-taking positions in the dialogue.

[0102] The method provided by the invention for optimizing syntax may beimplemented with a specialized simulation environment in which thesimulator performs perfect speech recognition. This approach ispreferred because signal recognition issues (ASR) can be therebydecoupled from the user interface issues posed by the promptingstructure. One desirable platform for such simulations is VISIOProfessional and Technical 5.0 by VISIO Corp., of Seattle, Wash. Anothersimulation tool with excellent features is Unisys Natural LanguageSpeech Assistant 4.0 by Unisys Corp., of Malvern, Pa.

[0103] The test prompts of the service provide the information tocomplete the tasks and to achieve the goal. Non-service specific tasksare also presented if they embody a prompt structure similar to thespecific service, so as to de-couple service dependence while addressingthe spoken syntax of the target language and culture. Turn-takinglocations, content of the verbal information, rate of presentation,grouping of options, and pause durations are implicit cues given to thesubscriber by the test prompts.

[0104]FIG. 6 shows a flow diagram for a method of the invention foridentifying an initial temporal syntax identification. The illustratedmethod permits modification of parameters in order to accommodatelanguage dependencies. At blocks 610-640, samples of dialogue arecollected relating to service tasks for the target application. In apreferred implementation, tasks are posed for achieving a set ofservice-specific goals.

[0105] In a particularly preferred embodiment, as illustrated in FIG. 6,the subjects are requested to perform several tasks calling for spokenresponses. Different speaking tasks may emphasize different parametersfor the temporal dimension of spoken dialogue in the target language.The purpose of these tasks generally is to generate (for capture andanalysis) samples of conversational speech containing phrase parts orother speech elements from the target language that contain temporalfeatures that contribute to clear, concise dialogue. It is preferablethat several versions of each speaking task be performed by each of agroup of subjects (10 to 50 or more). Larger numbers of versions foreach task and larger numbers of subjects will tend to yield moreaccurate initial estimates of the optimal values for the speechparameters of interest in the target language.

[0106] At block 610 of FIG. 6, the task is to respond to an open-endedquestion with a suitable sentence that should contain a carrier phrase,such as (in English) “How would you request someone's telephone number?”or, “How would you say that you didn't hear the telephone number?”Samples of such statements provide initial estimates for overall paceand rate of presentation in a comfortable yet effective dialogue in thetarget language.

[0107] At block 620, the target task is to recite a list of items,generally having greater than 5 items, in response to an open endedquestion such as “Say the colors of the rainbow.” Each response by atest subject is a spoken recitation of a list and provides sample datacontaining durations and locations of pauses in such a spoken list inthe target language. It is preferable that several different listshaving various numbers of commonly known items (e.g., fruits, trees,cities) be requested from each subject. The request prompts may bewritten or spoken. Spoken prompts are preferred so as to promotespontaneous and natural speech patterns. If written, the request promptsmay use different punctuation between the list items (e.g., itemsseparated by commas, or semicolons) to test for context variations thataffect the manner in which such lists are spoken. The request promptsfor different lists are preferably interspersed with each other and withrequest prompts for other tasks (“shuffled”) to test forinter-recitation dependencies.

[0108] The target task of block 630 is to have test subjects say atelephone number. It is preferable that responses are collected to openended questions such as “Please say your office telephone number.”Alternatively, or in addition, the subjects may be requested to recitecurrency amounts or other numerical quantities that may be used intypical conversations in the target language. Further, at block 640, thesubjects are presented with sentences that contain a question probingfor a yes/no format (e.g., “What would you say to someone if you're notsure whether they said yes or no?”). The spoken responses provided foreach of blocks 610-640 are collected (e.g., recorded) for analysis.

[0109] The task requests of blocks 610-640 are considered to provideparticularly preferred procedures for effectively identifying theprimitive, temporal “phrase” parts of the types of “sentences” in thetarget language that are likely to be spoken in a dialogue with the VAUI. The temporal components of such questions provide the initialparameter values that are specified in the initial temporal syntax.

[0110] At block 640, based on the response of the key variables,desirable values of the syntax parameters are identified. A consistentset of the desirable parameter values is selected at block 650, wherebythe initial syntax is specified.

[0111] Integration and Optimization

[0112] Again returning to FIG. 2, the prompts used by the service (i.e.,the outputs from the VA UI to the subscriber) are preferably testedagain at block 70 after being integrated with the final vocabulary set.The combined syntax and semantics, now adapted for the target language,can then be optimized at block 80. The objective of this jointoptimization is to ensure that each prompt of the syntax structurereliably elicits from the subscriber a spoken command included in thewords of the relevant sub-vocabulary wordset.

[0113] As in the procedure for selecting an initial syntax, the promptsare preferably tested in the entire service task domain to ensureappropriate interpretation by the subscriber. The subjects are testedusing the language-specific temporal syntax to verify that, for theentire service, functions are reliably executed in the easiest and mostefficient manner. Similar to the protocol for selecting the initialsyntax, the subjects may be asked to complete several realistic tasksthat exercise all major call paths of the service. The tests may bevideotaped for subsequent review and quantification of results in areaswhere performance may be improved.

[0114] The preferred prompt testing for dialogue sample collection maybe implemented by the following protocol. A “session” is a serviceinteraction where a user is directed to achieve specific tasks, works tocomplete tasks, and receives a questionnaire for comments on how wellthe syntax helped complete the tasks. Subjects are usually videotapedfor later review. After completion of a set of tasks, relevant variablesare measured and performance values determined. Questionnaires may beanalyzed for additional information.

[0115] A task set is preferably composed of two tasks, one performedafter the other, with the first task testing basic functions andenabling learning to take place. The second task is more complex andallows measurement of learning effects. Each task may be composed of aset of from 2 to 6 subtasks. The subtasks in Table 1 below are typicalof activities required in the exemplary VCVM service: TABLE 1 Sub-taskDescription 1. Review messages and save or delete them, if the name andtelephone number is present. 2. Transfer a specific message to anothermailbox. 3. Change the greeting. 4. Change the passcode. 5. Reviewmessages in linked-listen mode. 6. Speak a “wake-up” word to interruptplayback of a message. 7. Correct a small number of simulated speechrecognition errors.

[0116] Key variables (performance indicators) are preferably trackedthrough each testing session. Identifying how these variables change indifferent conditions determines parameter settings for best overallsystem performance. A preferred set of key variables to be tracked inthe testing sessions is set forth in Table 2 below. The key variablesrelevant to initial syntax identification are primarily those of thefirst category, for user interface issues. Key variables of the secondcategory (for ASR issues) may also be tracked in the testing and arerelevant at later stages of the VA UI development process. TABLE 2 UserInterface ASR Technology Task duration measures the amount of timeRejection rates and recovery from rejections has spent attempting toachieve the goal(s). It are monitored. excludes time spent listening tomessages, a greeting or a passcode. The barge-in location and frequencyis ASR Error Type and location are logged. A tracked. Barge-in indicatessufficient user may mis-speak, have a bad accent, say information for adecision, and turn-taking. the wrong word, or background noise becomeThese areas are improved by refining turn- too loud. Analysis indicateswhether ASR taking cues, and providing better collateral technologyneeds re-tuning, parameter materials or on-line tutorials. resetting, ora speaker fault occurred. Throughput rates (successfully completion OVWfrequency and location responses is the transaction) measures taskcompletion. tracked to determine that the rejected words Error handlingis examined at points of failure. can logically be used as responses.Interviews are performed during and after any The location of yes/noquestions, and the taskset to identify specific points of troubleresponses are tracked. A yes/no question with the service, what the userwas trying to impacts throughput. Yes/no questions may do, and how theyfailed. also indicate that the recognizer is having trouble. Synonymsfor yes or no responses are tracked, and even captured so these wordsmay be supported. Questionnaires measure satisfaction, and potentialproblem areas and user needs. Surveys provide direct information fromusers and a means to track trends in satisfaction.

[0117] Depending on the degree of development of the service (softwareand hardware availability) testing may be performed by a simulation (invitro). The simulation environment (also called a Wizard of Oz, or “WOZ”simulation) decouples the ASR technology from the VA UI. This means thatthe simulator (the “wizard”) acts as a perfect recognizer, therebyfocusing the subject on only task-specific actions toward achieving theapplication goals. Preferably the simulator allows ASR errors to beinjected into the simulation at later stages, in order to observe useractions and to test UI support of error handling. Such testing withcontrolled ASR errors helps to ensure that the user will be brought backinto a successful service execution path when an error occurs in thedeployed system. A simulation wizard may be used, as described abovewith respect to syntax optimization. Alternatively, testing may beperformed on a trial platform (in vivo) that includes ASR technology andcouples ASR performance back into the service.

[0118] In either case, it is preferred that a small number of volunteers(10-15 subjects) be tested to identify any difficulties. The tests arevideotaped for subsequent review and quantification of areas ofsyntactic and semantic performance shortcomings. Tests to determine thevalues of syntactic parameters are posed to a set of subjectsinteracting with the service through a set of service-specific goalswhere the prompts of the service provide sufficient information tocomplete the tasks to achieve the goal. Turn-taking locations, contentof the verbal information, rate of presentation, grouping of options,and pause durations are implicit cues given to the subscriber. Thevalues of these parameters are varied through the tests in order toimprove performance in the person-machine VA UI for the target languageas used by the target community. Often, competing prompting grammars aretested to isolate the effects of syntactic changes.

[0119] Adaptive Prompting Method

[0120] The present invention also provides a new and unique syntacticstructure that actually turns to advantage the temporal limitations ofvoice activated services. Heretofore, all UIs have utilized a simple,serial syntax in which options are stated iteratively, one at a time,and responses are requested only one at time. The present inventionutilizes a syntactic structure that supports presentation of a small setof multiple (parallel) options, from which the user can select a desiredchoice by saying the corresponding command word from the current contextat any time.

[0121] The invention provides a general syntactic structure (or“temporal template”) that includes combining temporal and grammaticalcues to signal those points where turn-taking is expected to occur. Thespeech recognizer can be active at all times, so the subscriber mayactually speak at any time. However, the template increases the accuracyof the ASR technology, as well as permits identifying and takingadvantage of the resource's duty cycle, by predicting speech inputs bythe user at specific time intervals through use of turn-taking cues.

[0122]FIG. 7A illustrates a preferred prompt grammar template 700 asprovided by the invention. The template begins with a short, spokenIntroductory Label 710 (such as a menu name) which is designed to orientthe listener. The label 710 provides a navigational cue (context) as towhere the subscriber is in the overall menu structure, and to advanceduser, an association with the permitted responses. A first pause 715 isthen provided of length Pause1, to allow a short interval where aresponse may be spoken without hearing any of the available choices ofthat menu. Pause 715, which will be called a “carrier phrase pause,” isused by advanced subscribers of the service who know what they want tochoose at this point. The pause length Pause1, however, is not longenough to disrupt the dialogue. These culturally dependent pauses aredetermined by the syntactic tests described earlier.

[0123] The grammar template 700 then breaks the set of menu selectionsinto conceptual “chunks” of between 2 and 4 choices presented as agroup. This grouping of choices improves the usability of the resultingVA UI by calling into service the user's capacity for parallelassociation. The preferred chunk size (2-4 choices) provides a smallamount of information upon which action can be taken as well as notoverload auditory short term memory.

[0124] A first prompting chunk 720 begins with a short carrier phrase(e.g., “You may say . . . ,” or “Say . . . ,”), then a first group ofresponse options Chunk1 is spoken by the service. It is preferred thatthe group of choices for Chunk1 includes the rank-ordered, mostfrequently used commands for the current menu. The pacing and intonationof the chunk is typical for the target language, generally with a slightfalling inflection at the end of the last word to signal a grammaticalbreak and an opportunity to respond.

[0125] A second pause 725 of length Pause2 is then provided for aresponse by the user. It is preferred that the pause duration Pause2 belonger than Pause1 and of sufficient time length to enable cognitive(decision making) processing and to provide reaction time for the userto select an option from the current chunk. Pause 725 is an implicit(syntactic) signal at a conceptual (semantic) boundary that indicatesthe listener may take a turn and speak. Both the falling intonation andPause2 signal that this is a turn-taking event.

[0126] If no response is made by the subscriber, the syntax specifiesthat a second prompting chunk 730 be spoken by the service to present asecond group Chunk2 of response options. Chunk2 preferably includes thenext most frequent set of choices, after the choices offered in Chunk1.Although grammar template 700 as illustrated in FIG. 7A includes onlytwo prompting chunks 720 and 730, it will be apparent to those skilledin the art that as many such prompting chunks may be provided as areneeded to present the current menu options. All of the alternatives inthe menu are eventually covered in this manner, so that the subscriberseventually hear all available options.

[0127] It is preferable for the UI to be able to make available all menuspecific options at any time. General choices (cancel, help, Main Menu,for example) are preferably unstated but always available, and wordsfrom other menus may be accepted whereby the user may directly “jump” toanother menu. However, a prompting structure implementing the templateencourages selection from specific chunks of options at each of thepauses 725, 735, etc. This preferential prompting has the additionaladvantage of allowing greater emphasis on recognition of the responseoptions offered in the preceding prompting chunk (chunk 720 for pause725, chunk 730 for pause 725, and so forth). This feature increases thelikelihood of successful recognition for the response options mostlikely to be chosen at each pause.

[0128] After all options are proposed, a closure prompt 740 is spoken toindicate that all choices have been provided and a choice should be made(“please say your choice now”). This is an explicit verbal signal forturn-taking. A final pause 745 of duration Pause3 is then provided tosignify yet another turn-taking boundary before the system initiates analternative prompting style. It is preferably that Pause3 is slightlylonger than Pause2, to provide more time for new users to make a finaldecision.

[0129]FIG. 7B provides an example prompt grammar 750 as provided by theinvention and following the prompt grammar template 700. In the promptgrammar 750 an introductory segment 760 has verbal content “Main Menu”and corresponds to the Introductory Label 710 of the grammar template700. A pause 765 corresponds to pause 715 of the template 700 and has aduration (Pause1) of 250 milliseconds (ms). A prompting chunk 770includes an carrier phrase “Please say . . . ,” followed by a firstchunk of options “Messages, Fax or Address Book.”

[0130] A second pause 775, of duration 500 ms (Pause2), is followed by asecond prompting chunk 780 providing the options “Settings, Help orIntroduction.”

[0131] It is noted that FIG. 7B shows additional detail for theillustrated prompt grammar by indicating the presence of pauses (ofduration 250 ms, in this example) between the individual list items ineach prompt chunk. These so-called “intrachunk pauses” are naturalseparation intervals between successive items in a spoken list. It hasbeen found that the most effective duration for such an intrachunk pauseis culturally dependent and thus is desirably adjusted when designing aVA UI for a given target community. The intrachunk pauses demark theboundaries between successive list items, just as the “interchunkpauses” 775, 785, and so forth, demark the boundaries between successivechunks of information.

[0132] It is noted that the second prompting chunk 780 omits the carrierphrase (“Please say . . . ”) that was provided with the first promptingchunk 770. However, this arrangement is not essential to the promptgrammar of the invention. For example, alternative embodiment can use asuitable carrier phrase for the second and subsequent prompting chunks.

[0133] The example prompt grammar 750 follows the second prompting chunk780 by a second pause 785, where again the duration Pause2 in thisexample is 500 ms. A closure prompt 790 contains verbal content urgingthe user to select an option: “Please say your choice now.” Closureprompt 790 is followed by a third pause 795, which in this example ismuch longer than the pauses 775 and 785, or in this case 1000 ms.

[0134]FIG. 8 illustrates a flow diagram for a VA UI prompting methodprovided by the invention and corresponding to the prompt grammartemplate 700. The introducuction, such as a name or descriptive title ofa current menu, is announced to the user at block 810. The UI determinesat block 815 whether a recognizable command is received during the pause715 following the introduction. If pause 715 passes without a response,then the method proceeds to block 820 where a prompting chunk for thecurrent menu is recited. Block 825 determines whether a command isreceived during the second pause that follows the prompting chunk oflock 820. If no response is detected at block 825, the method tests atblock 830 whether any more prompting chunks remain to be recited. If so,then the method returns to block 820 and the next prompting chunk isrecited.

[0135] If it is determined at block 830 that no more prompting chunksremain, then the method proceeds to the closure prompt at block 835. Afurther test is performed at block 840 to determine whether a responsehas been received. The method then preferably switches to an alternativeprompting style at block 845 and returns control of the device. Anothersyntactic template may include a counter in block 845 to repeat theprompting sequence starting at 815 one or more times. If any of theresponse detection queries 815, 825, and 840 indicate that a suitablecommand has been received, then the method proceeds directly toexecution of the detected response at page 850 and returns.

[0136]FIG. 9 illustrates a flow diagram for an alternative promptingmethod as provided by the invention. A method of this aspect of theinvention may begin at block 910 by entering a secondary prompt grammar,which in the preferred case occurs when prompting by the primary grammarfails to elicit a suitable response.

[0137] The alternative prompting style illustrated in FIG. 9 isgenerally targeted for new users and is provided if no action is takenafter the closure prompt at block 835 of FIG. 8. Preferably this is afinal prompt style that exhaustively iterates each individual choice,one at a time, posed in a yes/no context. A further introduction to useof the system may be presented, as shown by block 915, which inform theuser that this prompting syntax requires a response or the system willterminate the entire session. The goal is to forceably evoke a verysimple response from those users who are still not sure what to do, yethave heard all the available options and have not yet responded.Turn-taking is explicit and forced: a response option is presented atblock 920, and a response request (“yes or no”) is stated at block 925.At block 930 it is determined whether the user has answered “yes.” Ifso, then the method proceeds with processing the approved option andreturns.

[0138] If a “yes” response is not detected at block 930, then at block945 it is determined whether the user has responded with “no.” In oneversion of the method, if a “no” response has been received at block945, then it is determined at block 950 whether more response optionsexist to be offered in the yes/no format. The method returns to block920 from block 950 if there are more options. If no further options arefound at block 950, or if no response from the user is detected at block945, then the method disconnects the user from the service at block 955and exits. Alternatively, the method may augment a test procedurecorresponding to block 945, wherein no spoken response is taken toimplicitly mean a spoken “no” response and propose the available optionsin order until all options are determined to be exhausted by block 950.

[0139] The alternative prompting method of FIG. 8 is more efficient andeasier to use than the method illustrated in FIG. 9, because the formerallows the user to dynamically take control of the dialogue. Theexplicit, forced turn-taking of the alternative method is desirable onlyin a limited set of situations, such as when the user is unprepared orhesitant to share control with the UI. Explicit, forced turn-taking canalso be useful for handling errors, as discussed in the next section.

[0140] Adaptive Error Handling

[0141] Any VA UI must address two issues: successfully accomplishing aservice supported task, and error handling of system or subscribermistakes. Successful tasks are achieved by the subscriber saying theright words at the right time, hence by speaking valid “sentences” asdetermined by the syntax and semantic of the VA UI. The means forsuccess were discussed above. The key measurement in this regard is thenumber of operations required to achieve the goal and the task duration.

[0142] On the other hand, over many users, errors will inevitably occur.It is therefore highly desirable for a practical VA UI to include aconsistent mechanism to handle errors. There are two types of VA UIfailures: system and user. System errors are generally attributable toASR errors, which often arise from microphone misplacement, spuriousbackground noises, and user hesitations (“er”, “uh”).

[0143] User errors result from many reasons: didn't hear the prompt,misheard the prompt, said the wrong word, mispronounced a word, changedyour mind, background sound was interpreted as a word, etc. Theframework of syntax and semantics, as provided by the present invention,also applies to user errors. In particular, a further aspect of theinvention provides for decoupling user errors from system errors andtesting the user errors through the service simulation. This enablesgeneration and analysis of UI results relating specifically to the usererrors. The analysis can be looped back into the UI design process toprovide further robustness against user errors and actions to remedy theerrors.

[0144] User errors are generally attributable to two types of causes:misleading or incorrect prompts, and the user's reliance on an impropermental model of the service. The PI goal is to minimize errors that arepreventable (minimize number of operations) and to resolve errors asefficiently and quickly as possible (minimize task time).

[0145] The invention provides error handling methods in which two mainuser error treatments are decoupled: treatment for errors of omission(no response), and treatment for errors of commission (incorrectresponse). Error detection is measured by changes in the behavior of theindividual. A confused subject normally exhibits increased reaction timebefore any new action is taken, or produces non-task related speech(OVWs, interjections, “thinking out loud”). Latency time leads todetermining timing thresholds that may trigger a “help” command.

[0146] Error correction is generally performed by speaking conceptuallyequivalent recovery words, such as “back-up”, “undo”, “Main Menu” or“cancel.” This results in the subject being moved backwards to theprevious state or back to the start. The subject then solves the taskfrom this new state. Different prompts may be given based on the degreeof subject confusion: longer, more explicit prompts for subscribershaving more trouble as measured by repeated errors, repetition ofsuccessful tasks or ongoing latency between spoken choices.

[0147] Errors of omission occur when the user provides no response whenexpected. These errors are considered to arise from syntactical failuresand are addressed by reprompting with an alternate prompting structurehaving a simpler syntactical component. Error handling is performed by atime-out treatment that builds on the syntactic (temporal) cues,followed by reprompt, followed by eventual disconnection if no responseoccurs. A primary prompting syntax may be repeated. A second, morestructured syntax with more clarification given in each prompt choicemay be provided if the omission error continues.

[0148]FIG. 10 illustrates an aspect of the invention providing a methodfor handling errors of omission. A monitoring process, which may becarried out in background, is performed at block 1010 to detect changesin the user's behavior responsive to prompts from the interface. Atblock 1015 it is determined whether the user has delayed providing aresponse beyond a predetermined timeout interval. As long as no timeoutoccurs, the monitoring merely continues.

[0149] If a timeout is detected (“yes” at block 1015), then an omissionerror is determined to have occurred and the method advances to block1020 where an error counter is incremented. Block 1025 determineswhether a predetermined error limit has been exceeded. If not, then theuser is reprompted at block 1030 and monitoring continues at block 1010.For example, as noted above, the user may be given another opportunityto respond appropriately from within the primary prompting structure. Inthis case the error limit may be a local limit, indicating a limit forerrors since the last prompt. Other types of error limits are possiblealso, such as a limit referencing the total number of errors that haveoccurred in a given dialogue session.

[0150] If the appropriate error limit has been exceeded (“yes” at block1025), then the illustrated method proceeds to block 1035 where asecondary prompting syntax is adopted. For example, a prompting methodas illustrated in FIG. 9 may be employed. At block 1040 the user isreprompted based on the seconary prompting structure. Timeout is againchecked at block 1045. If the user provides an appropriate responsewithin the applicable timeout limit (which may be different from thetimeout limit applied at block 1015), then the method proceeds to block1050 where the error counter is reset. If the error limit is other thanthe local limit noted above, then block 1050 may be omitted orrelocated. Following block 1050, or upon a “no” determination at block1045, the method returns to monitoring at block 1010.

[0151] If the user again fails to provide a response within theapplicable timeout limit (“yes” at block 1045), then the method proceedsto block 1055 where the user is disconnected from the service. Thissequence parallels the “no” determination from block 945 in FIG. 9 withflow proceeding to disconnection at block 955.

[0152] Commission errors occur when the user provides an incorrectresponse, such as providing a recognized word that performs an undesiredcommand, or when an appropriate word is recognized as an “out of thevocabulary” word (OVW). Such errors tend to arise from semantic failuresand are addressed by “second chances” and error correction options.

[0153]FIG. 11 illustrates a method of the invention for handlingcommission errors. A monitoring procedure at block 1110 parallels themonitoring procedure of block 1010 in FIG. 10. At block 1115 it isdetermined whether the user has said a command word for a correctioncommand. If not, then it is determined at block 1120 whether a responseby the user is an OVW. If an OVW is not detected at block 1120, then themethod returns to the monitoring procedure at block 1110.

[0154] If a correction command is detected (“yes” at block 1115), thenthe user is returned to a previous menu state at block 1125. Forexample, the VA UI may provide the word “main menu” as an escape commandby which the user can back out to the main menu from any of thesubmenus. See, for example, the submenus shown in the examplesubvocabulary specification of FIG. 4. If the user says “main menu” fromwithin a submenu, then the VA UI returns the menu state to the main menuand the user can try again to perform the desired task.

[0155] Handling of commission errors by the invention may include simplyreturning to the monitoring state after a correction command has beenexecuted. However, the method illustrated in FIG. 11 includes theoptional feature of incrementing a prompt level at block 1130 followingmenu-state return at block 1125. Error prompt levels will be discussedbelow with reference to FIG. 12. After incrementing the prompt level atblock 1130, the method of FIG. 11 proceeds to block 1135 where the useris prompted for the current menu based on the current prompt level. Theflow then returns to the monitoring state at block 1110.

[0156]FIG. 12 illustrates an exemplary implementation of error promptlevels as provided by the invention. A procedure for monitoring theuser's behavior is again carried out at block 1210. At block 1215 it isdetermined whether a user error has occurred. The error prompt levelsprovided by this aspect of the invention may be implemented with eitheromission error handling, or commission error handling, or both. If noerror is detected, then the method continues monitoring at block 1210.

[0157] If an error is detected (“yes” at block 1215), then the methodincrements an error counter at block 1220. In the illustrated example,it is determined at block 1225 whether the error count exceeds a limitMAX. If so, then the VA UI disconnects the user from the service atblock 1230. For example, the user may be disconnected if repeatedprompting fails to elicit an appropriate response.

[0158] If the error limit has not been exceeded (“no” at block 1225),then the method proceeds to block 1235 where it is determined whetherthe error count is greater than a threshold value. In the exampleillustrated in FIG. 12, threshold=1. If the error threshold has not beenexceeded at the current prompt level, then the method maintains thecurrent prompt level, reprompts the user at block 1240, and returns tothe monitoring procedure at block 1210.

[0159] If the error threshold has been exceeded (“yes” at block 1235),then the illustrated method advances to block 1245 where the errorprompt level is incremented. The operation of block 1245 thus parallelsthe operation of block 1130 in the method illustrated in FIG. 11.

[0160] As indicated by FIGS. 11 and 12, a preferred embodiment of theinvention provides plural error prompt levels. The invention may providetwo or more prompting structures that together implement the use of moreclarifying prompts at each of successive stages of user difficulty. Anerror counter, such as a local counter, a transaction counter, or apersonal profile counter, keeps track of the number of errors which haveoccurred over a time interval, and lets the system take different actionfor different levels of error. For example, the UI may change the promptwording to add more clarification, or break the task into simplersubtasks, or (the simplest prompt structure) pose a highly structuredprompt to be answered by a yes or no response.

[0161]FIG. 13 shows a functional diagram of a standard form forsystem-wide error handling procedures in a preferred embodiment of theinvention. A three-part procedure is followed that includes stages ofnotification, status, and solution. Notification can be null, non-verbal(longer silence, or an error tone sequence), or verbal (for example,“sorry”). This sets the context to indicate that an error has beendetected by the system.

[0162] The status describes the type of error made (for example, “thetelephone number is not correct”). Preferably this information isomitted for one-step tasks, because in such situations the type of errorthat has occurred is merely reiterated (e.g., “you have entered708-555-1212”). The solution stage explains what may be done or shouldbe done to correctly perform the task (for example, “you must enter aten digit telephone number”).

[0163] The syntax of error handling is the sequence of operations andpauses between the operations executed (some operations may be omitted).The semantics of error handling incorporates the words and sentencesprovided as feedback to the subscriber. Thus, the error handlingsemantics may depend on the nature of a persona attributed to theservice at a specific prompting level.

[0164] The terms and expressions employed herein are used as terms ofdescription and not of limitation, and there is no intention, in the useof such terms and expressions, of excluding any equivalents of thefeatures shown and described or portions thereof, but it is recognizedthat various modifications are possible within the scope of theinvention claimed.

What is claimed is:
 1. A method for optimizing a voice activated userinterface, the method comprising: configuring the user interface with avocabulary of command words including at least one word indicating acorresponding task and selected from plural words for the task based onfrequency of use; and changing at least one of a command and a syntaxparameter of the user interface based on results of testing the userinterface with speakers of a target language.
 2. A method as recited inclaim 1, further comprising selecting words of the vocabulary fromfrequently-used words given by speakers of the target language inresponse to task-oriented questions.
 3. A method as recited in claim 1,further comprising: identifying an initial value for each of one or moresyntax parameters of the user interface from samples of dialogue in aconversational language of a target community; and specifying an initialtemporal syntax for the user interface based on the one or moreidentified initial values.
 4. A method as recited in claim 1, furthercomprising obtaining the testing results by a procedure including:posing a task set for a subject to perform using the user interface; andcollecting dialogue information for the user interface when the subjectperforms the task set.